The Bayesian Analysis will focus on the UCI Heart Failure Clinical
Records dataset, which has features that deal with medical records of
patients who had heart failure. These records were collected in their
following period and we will be predicting their survival with this data
set. Each patient had 11 features collected from them, with the 12th
feature being the target variable: deaht_event
Dataset: https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records
The dataset consists of 12 features:
age: the age of the patient
in years
creatinine_phosphokinase: level of the CPK enzyme in the
blood in mcg/L
anemia: decrease of red blood cells or hemoglobin
diabetes: if the patient has diabetes
ejection_fraction:
percentage of the blood leaving the heart at each contraction
high_blood_pressure: if the patient has high blood pressure or not
platelets: platelets in the blood in kiloplatelets/mL
serum_creatinine: level of serum creatinine in the blood in mg/dL
serum_sodium: level of serum sodium in the blood
smoking: if the
patient smokes or not
time: follow-up period
death_event: if
the patient died during the follow-up period
sex: man or woman
heart_failure_data <- read.csv("heart_failure_clinical_records_dataset.csv")
heart_failure_data <- clean_names(heart_failure_data)
heart_failure_data <- heart_failure_data |>
rename(anemia = anaemia)
head(heart_failure_data)
## age anemia creatinine_phosphokinase diabetes ejection_fraction
## 1 75 0 582 0 20
## 2 55 0 7861 0 38
## 3 65 0 146 0 20
## 4 50 1 111 0 20
## 5 65 1 160 1 20
## 6 90 1 47 0 40
## high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
## 1 1 265000 1.9 130 1 0 4
## 2 0 263358 1.1 136 1 0 6
## 3 0 162000 1.3 129 1 1 7
## 4 0 210000 1.9 137 1 0 7
## 5 0 327000 2.7 116 0 0 8
## 6 1 204000 2.1 132 1 1 8
## death_event
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
dim(heart_failure_data)
## [1] 299 13
There are 299 rows and 13 columns, 1 of which is the response variable, death_event. We are going to predict whether or not the patient died based on the different features that exist.
The project will focus on several methods that will predict this
variable death event:
1. Logistic Regression
2. Naive Bayes
Classifier
3. Random Forest Classifier
I will also find out if there is correlation between one of the
continuous variables: creatinine phosphokinase and ejection_fraction and
other variables using the following methods:
1. Linear Regression
2. Linear Regression (interaction term)
summary(heart_failure_data)
## age anemia creatinine_phosphokinase diabetes
## Min. :40.00 Min. :0.0000 Min. : 23.0 Min. :0.0000
## 1st Qu.:51.00 1st Qu.:0.0000 1st Qu.: 116.5 1st Qu.:0.0000
## Median :60.00 Median :0.0000 Median : 250.0 Median :0.0000
## Mean :60.83 Mean :0.4314 Mean : 581.8 Mean :0.4181
## 3rd Qu.:70.00 3rd Qu.:1.0000 3rd Qu.: 582.0 3rd Qu.:1.0000
## Max. :95.00 Max. :1.0000 Max. :7861.0 Max. :1.0000
## ejection_fraction high_blood_pressure platelets serum_creatinine
## Min. :14.00 Min. :0.0000 Min. : 25100 Min. :0.500
## 1st Qu.:30.00 1st Qu.:0.0000 1st Qu.:212500 1st Qu.:0.900
## Median :38.00 Median :0.0000 Median :262000 Median :1.100
## Mean :38.08 Mean :0.3512 Mean :263358 Mean :1.394
## 3rd Qu.:45.00 3rd Qu.:1.0000 3rd Qu.:303500 3rd Qu.:1.400
## Max. :80.00 Max. :1.0000 Max. :850000 Max. :9.400
## serum_sodium sex smoking time
## Min. :113.0 Min. :0.0000 Min. :0.0000 Min. : 4.0
## 1st Qu.:134.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 73.0
## Median :137.0 Median :1.0000 Median :0.0000 Median :115.0
## Mean :136.6 Mean :0.6488 Mean :0.3211 Mean :130.3
## 3rd Qu.:140.0 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:203.0
## Max. :148.0 Max. :1.0000 Max. :1.0000 Max. :285.0
## death_event
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3211
## 3rd Qu.:1.0000
## Max. :1.0000
heart_failure_data |>
summarise_all(~sum(is.na(.)))
## age anemia creatinine_phosphokinase diabetes ejection_fraction
## 1 0 0 0 0 0
## high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
## 1 0 0 0 0 0 0 0
## death_event
## 1 0
No missing records, so we can take a look at the individual variables. First let’s take a look at a correlation matrix between the all the variables so we can pick some variables that will work well with both of our variables and be good predictors.
corr_matrix <- cor(heart_failure_data, use = "complete.obs")
corrplot(corr_matrix, method = "circle")
From the correlation matrix, we can tell that there are a few
variables of interest when looking at the time variable.
There seems to be some weak correlation between anemia, age, high blood pressure and serum creatinine. Based on this, let’s use the high blood pressure variable and the age variable to see if we can predict serum_creatinine. In context, the serum creatinine level is important because it is a blood test that measures how much blood is flowing through the kidneys and if the kidneys are functioning well. Although this may not be immediately relevant to clinical heart failure, if the blood flow to the kidneys and through the urinary tract is impacted, then it may indicate problems that exist in the cardiovascular system.
Further, it seems that death event has the most correlation with time, serum_creatinine, ejection_fraction, age, and serum sodium. We will use these variables in our analysis and prediction of the death_event. The death event variable is the target variable and is relevant in context because we will be able to see if the patients who had heart failure will survive after their followup period.
The variables that we are interested in is to see if there is a relationship between the follow up period and the age of the patient with respect if the patient has high blood pressure or not.
I will start with a linear regression model with no interaction term. I will also use default priors, since there I do not have any prior knowledge regarding the distribution of follow-up period in clinical heart failure patients.
First let us do a preliminary regression graph and see the correlation:
heart_failure_data |>
ggplot(aes(x = age, y = time, color = high_blood_pressure)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(xlab = "age of patient", ylab = "follow up period ")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
The correlation seems to be pretty weak, as the age and the time seem to look like a bit of random noise. However, we can see that high blood pressure is also pretty weakly related, as it seems that a lot of the clinical heart failure patients have high blood pressure and there’s no distinct split.
main_model <- stan_glm(
time ~ age + high_blood_pressure,
data = heart_failure_data,
family = gaussian,
prior = normal(0,2.5, autoscale = TRUE),
prior_intercept = normal(0,2.5, autoscale = TRUE),
prior_aux = exponential(1, autoscale = TRUE),
chains = 4,
iter = 2 * 5000,
seed = 84375
)
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0.000363 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 3.63 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.247153 seconds (Warm-up)
## Chain 1: 0.413184 seconds (Sampling)
## Chain 1: 0.660337 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 2.2e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.282221 seconds (Warm-up)
## Chain 2: 0.419368 seconds (Sampling)
## Chain 2: 0.701589 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 2e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.262211 seconds (Warm-up)
## Chain 3: 0.411025 seconds (Sampling)
## Chain 3: 0.673236 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 1.6e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.297571 seconds (Warm-up)
## Chain 4: 0.411571 seconds (Sampling)
## Chain 4: 0.709142 seconds (Total)
## Chain 4:
prior_summary(main_model)
## Priors for model 'main_model'
## ------
## Intercept (after predictors centered)
## Specified prior:
## ~ normal(location = 0, scale = 2.5)
## Adjusted prior:
## ~ normal(location = 0, scale = 194)
##
## Coefficients
## Specified prior:
## ~ normal(location = [0,0], scale = [2.5,2.5])
## Adjusted prior:
## ~ normal(location = [0,0], scale = [ 16.31,405.82])
##
## Auxiliary (sigma)
## Specified prior:
## ~ exponential(rate = 1)
## Adjusted prior:
## ~ exponential(rate = 0.013)
## ------
## See help('prior_summary.stanreg') for more details
Now let’s check the mcmc diagnostics:
mcmc_trace(main_model)
mcmc_dens_overlay(main_model)
mcmc_acf(main_model)
neff_ratio(main_model)
## (Intercept) age high_blood_pressure sigma
## 1.21080 1.19365 1.35060 1.34155
rhat(main_model)
## (Intercept) age high_blood_pressure sigma
## 0.999981 0.999989 1.000040 1.000048
Taking a look at the numerical and graphical mcmc diagnostics of the model with no interaction term, it shows that it performs well.
Now we will look at the tidy summary of the posterior variables:
tidy(main_model, c("fixed", "aux"), conf.int = TRUE, conf.level = 0.80 )
## # A tibble: 5 × 5
## term estimate std.error conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 223. 22.6 194. 251.
## 2 age -1.36 0.363 -1.82 -0.893
## 3 high_blood_pressure -28.7 9.14 -40.4 -17.1
## 4 sigma 74.7 3.09 71.0 78.9
## 5 mean_PPD 130. 6.12 122. 138.
The interpretation of the tidy summary is that on average, the follow up period will be 1.35 days shorter with an increase in 1 year in age. Further, it seems that people with high_blood_pressure have a shorter follow up period of 28 days than those without high blood pressure.
For every one year increase in age, the followup will be 1.35 days shorter given all other variables are kept constant.
Further, for patients with high blood pressure, the follow up period will be average 28 days shorter than patients without high blood pressure.
Further, when all the other variables are kept at 0, then the average follow up period is 222 days.
Making posterior predictions using the main model
I want to find
the posterior distribution of a heart disease patient who is 50 years
old and has anemia
new_data <- data.frame(
age = 50,
high_blood_pressure = 1
)
posterior_prediction_model <- posterior_predict(main_model, newdata = new_data)
mcmc_areas(posterior_prediction_model) +
ggtitle("Distribution of Posterior Predictive Model for a 50 year old with high blood pressure") +
xlab("Follow up period (days)") +
scale_x_continuous(
breaks = seq(-200, 400, by = 50) # Define the range and interval for breaks
)
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
The distribution shows that the average follow up days is going to be
~125 days. We can do a little more investigating by looking at the mean,
sd, and median summary of the posterior predictions
posterior_data <- data.frame(posterior_prediction_model)
posterior_data |>
summarize(
mean = mean(posterior_prediction_model),
median = median(posterior_prediction_model),
sd = sd(posterior_prediction_model),
)
## mean median sd
## 1 125.9397 125.9047 75.72613
The mean of a patient who is 50 years old and has a high blood pressure is average 125.93 days with a large standard deviation of 75.5 days.
Let’s take a look if adding an interaction term between the age and the high blood pressure variable would be beneficial to our analysis. The predictions currently seem very weak and adding this term may improve the correlation
I will be making the same assumptions for the prior as the previous model since we do not have prior knowledge.
interaction_regression <- stan_glm(
time ~ age + high_blood_pressure + high_blood_pressure:age,
data = heart_failure_data,
family = gaussian,
prior = normal(0,2.5, autoscale = TRUE),
prior_intercept = normal(0,2.5, autoscale = TRUE),
prior_aux = exponential(1, autoscale = TRUE),
chains = 4,
iter = 2 * 5000,
seed = 84375
)
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 3.5e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.35 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.996071 seconds (Warm-up)
## Chain 1: 1.25245 seconds (Sampling)
## Chain 1: 2.24852 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 2e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 1.01014 seconds (Warm-up)
## Chain 2: 1.21714 seconds (Sampling)
## Chain 2: 2.22727 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 1.5e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.960865 seconds (Warm-up)
## Chain 3: 1.10506 seconds (Sampling)
## Chain 3: 2.06593 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 2e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 1.06981 seconds (Warm-up)
## Chain 4: 1.23455 seconds (Sampling)
## Chain 4: 2.30436 seconds (Total)
## Chain 4:
prior_summary(interaction_regression)
## Priors for model 'interaction_regression'
## ------
## Intercept (after predictors centered)
## Specified prior:
## ~ normal(location = 0, scale = 2.5)
## Adjusted prior:
## ~ normal(location = 0, scale = 194)
##
## Coefficients
## Specified prior:
## ~ normal(location = [0,0,0], scale = [2.5,2.5,2.5])
## Adjusted prior:
## ~ normal(location = [0,0,0], scale = [ 16.31,405.82, 6.35])
##
## Auxiliary (sigma)
## Specified prior:
## ~ exponential(rate = 1)
## Adjusted prior:
## ~ exponential(rate = 0.013)
## ------
## See help('prior_summary.stanreg') for more details
Now let’s check the mcmc diagnostics again:
mcmc_trace(interaction_regression)
mcmc_dens_overlay(interaction_regression)
mcmc_acf(interaction_regression)
neff_ratio(interaction_regression)
## (Intercept) age high_blood_pressure
## 0.47405 0.47105 0.36095
## age:high_blood_pressure sigma
## 0.35875 0.70595
rhat(interaction_regression)
## (Intercept) age high_blood_pressure
## 1.000081 1.000084 1.000017
## age:high_blood_pressure sigma
## 0.999953 1.000076
The results are good. The neff_ratio is a little lower than the previous model we tested. Let’s do a comparison on the two models using the same posterior prediction
What we can do now is take a look at the plausibility of interaction graphically and see if we may need it
heart_failure_data |>
add_fitted_draws(interaction_regression, n = 50) |>
ggplot(aes(x = age, y = time, color = high_blood_pressure )) +
geom_line(aes(y = .value, group = paste(high_blood_pressure, .draw)), alpha = .1) +
geom_point(size = 0.1) +
ggtitle("50 Draws Posterior Model for Clinical Heart Failure Model")
## Warning: `fitted_draws` and `add_fitted_draws` are deprecated as their names were confusing.
## - Use [add_]epred_draws() to get the expectation of the posterior predictive.
## - Use [add_]linpred_draws() to get the distribution of the linear predictor.
## - For example, you used [add_]fitted_draws(..., scale = "response"), which
## means you most likely want [add_]epred_draws(...).
## NOTE: When updating to the new functions, note that the `model` parameter is now
## named `object` and the `n` parameter is now named `ndraws`.
Although there does not seem to be a very clear direction of the linearity in predicting the time, there seems to be a lot of interaction between the high blood pressure and the age. This means that including an interaction term could potentially make our predictions more accurate than compared to the model without an interaction term. In context, this would suggest that the fact that somebody has high blood pressure is related to age. This may be the case, as with higher age there is greater prevalence of high blood pressure.
tidy(interaction_regression, effects = c("fixed", "aux"), conf.int = TRUE, conf.level = 0.99)
## # A tibble: 6 × 5
## term estimate std.error conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 208. 27.3 136. 278.
## 2 age -1.10 0.442 -2.25 0.0760
## 3 high_blood_pressure 18.5 48.1 -112. 140.
## 4 age:high_blood_pressure -0.762 0.771 -2.71 1.33
## 5 sigma 74.8 3.16 67.4 83.6
## 6 mean_PPD 130. 6.19 114. 146.
Based on the tidy summary however there is a slight negative correlation, of -.76 and the 99% confidence interval includes the number 0, ranging from -2.7 to 1.3. This suggests that the interaction between age and whether or not a patient has high blood pressure could have really no impact on the model performance, albeit having a slight negative correlation.
We can also conduct hypothesis testing on the model to see if the interaction term is required: Let’s find the probability that the interaction term is greater than and less than zero.
library(posterior)
## This is posterior version 1.3.1
##
## Attaching package: 'posterior'
## The following objects are masked from 'package:rstan':
##
## ess_bulk, ess_tail
## The following object is masked from 'package:bayesplot':
##
## rhat
## The following objects are masked from 'package:stats':
##
## mad, sd, var
posterior_samples <- as.data.frame(interaction_regression)
prob_greater_than_zero <- mean(posterior_samples$`age:high_blood_pressure` > 0)
prob_less_than_zero <- mean(posterior_samples$`age:high_blood_pressure` < 0)
prob_greater_than_zero
## [1] 0.1641
prob_less_than_zero
## [1] 0.8359
We can interpret these probabilites as the following: if the probability of either greater than zero or less than zero is near 1, (0.95 or greater), then we have strong evidence that the interaction is different from zero. However, in both cases, we see that the probability that th example is less than 0.95 and that it is significantly off. These results indicate that our interaction term is really not needed or significant in the model.
Before model comparison, I will use the same predictive variables: a 50 year old with anemia to see if the interaction term has impact on the posterior distribution of the posterior prediction.
posterior_prediction_model_int <- posterior_predict(interaction_regression, newdata = new_data)
mcmc_areas(posterior_prediction_model_int) +
ggtitle("Distribution of Posterior Predictive Model (With Interaction) for a 50 year old with high blood pressure") +
xlab("Follow up period (days)") +
scale_x_continuous(
breaks = seq(-200, 400, by = 50) # Define the range and interval for breaks
)
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
posterior_data_int <- data.frame(posterior_prediction_model)
posterior_data_int |>
summarize(
mean = mean(posterior_prediction_model_int),
median = median(posterior_prediction_model_int),
sd = sd(posterior_prediction_model_int),
)
## mean median sd
## 1 131.5065 131.8002 75.1679
When compared to the main model without any interaction term, the model with the interaction term estimates a higher mean follow up period of 132 days compared to the 125 days. We will do some model comparison to see which model is performing better with respect to the data.
We can do a posterior predictive check to see which of the models fits the simulations better.
pp_check(main_model) +
ggtitle("Posterior Predictive Check Main Model Predicting Follow-Up Period")
pp_check(interaction_regression) +
ggtitle("Posterior Predictive Check Interaction Model Predicting Follow-Up Period")
Based on both of the graphs, we can see that the posterior predictive check is pretty poor. They both look like they are trying to simulate a trough in the graph that does not exist. This means that the models are not good at fitting the posterior distribution and that the predictor variables are probably weakly informative of the response variable, follow up time.
It is pretty clear that neither of the models perform that well, but for analysis and comparison we will compare the two models using loo and cross validation.
set.seed(84375)
cv_main <- prediction_summary_cv(model = main_model, data=heart_failure_data, k = 10)
cv_int <- prediction_summary_cv(model = interaction_regression, data = heart_failure_data, k = 10)
cv_main$folds
## fold mae mae_scaled within_50 within_95
## 1 1 77.54932 1.0461757 0.2333333 1.0000000
## 2 2 65.98484 0.8784974 0.3000000 1.0000000
## 3 3 78.99836 1.0529194 0.3000000 1.0000000
## 4 4 65.92003 0.8647420 0.4000000 1.0000000
## 5 5 42.78605 0.5567980 0.6000000 1.0000000
## 6 6 66.53690 0.8787333 0.3793103 1.0000000
## 7 7 61.67698 0.8153673 0.4000000 0.9333333
## 8 8 59.46183 0.7872379 0.3666667 1.0000000
## 9 9 64.76902 0.8589182 0.4000000 1.0000000
## 10 10 57.43885 0.7728510 0.4666667 0.9666667
cv_main$cv
## mae mae_scaled within_50 within_95
## 1 64.11222 0.851224 0.3845977 0.99
cv_int$folds
## fold mae mae_scaled within_50 within_95
## 1 1 74.80074 1.0059085 0.2666667 1.0000000
## 2 2 66.17776 0.8707696 0.3000000 1.0000000
## 3 3 69.23780 0.9276505 0.2666667 0.9666667
## 4 4 84.21564 1.1312862 0.2333333 1.0000000
## 5 5 57.39099 0.7556834 0.4000000 1.0000000
## 6 6 45.56893 0.5962527 0.6206897 0.9655172
## 7 7 61.39272 0.8148010 0.4000000 1.0000000
## 8 8 54.52216 0.7191157 0.4333333 1.0000000
## 9 9 77.48096 1.0286503 0.3666667 0.9666667
## 10 10 71.37116 0.9422927 0.4666667 1.0000000
cv_int$cv
## mae mae_scaled within_50 within_95
## 1 66.21589 0.8792411 0.3754023 0.9898851
By comparing the scaled MAE, we can see that both models do not perform very well. The predictors that we picked are weakly infomrative of the response variable of time. However, we can see that the scaled mae of the original model is better than that of the interaction model. Let’s use loo to confirm the cv approach.
loo_main <- loo(main_model)
loo_main
##
## Computed from 20000 by 299 log-likelihood matrix
##
## Estimate SE
## elpd_loo -1715.9 8.2
## p_loo 3.3 0.2
## looic 3431.9 16.3
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
loo_int <- loo(interaction_regression)
loo_int
##
## Computed from 20000 by 299 log-likelihood matrix
##
## Estimate SE
## elpd_loo -1716.5 8.1
## p_loo 4.2 0.3
## looic 3432.9 16.3
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
loo_compare(loo_int,loo_main)
## elpd_diff se_diff
## main_model 0.0 0.0
## interaction_regression -0.5 0.9
The interaction model is slightly worse in elpd than the main model. This combined with the fact that the cv was worse means that the interaction term did not really have a positive impact on the performance on the model and we should use the main model. However, it is important to note that the main model’s performance is also not that good. The scaled mae for cross validation was 0.85, meaning it was not able to predict across most scenarios for the variable follow up days time.
For the classification of death_event variable, I will apply one in class technique and compare using two out of class techniques and compare the performance of the models.
The following methods will be applied:
1. Logistic Regression
2. Naive Bayes Classifier
3. Random Forest Algorithm
Since we are using models that are not necessarily ‘bayesian’ we will use a train_test_split methodology, splitting the data into training sets and test sets.
set.seed(84375)
heart_failure_classification <- heart_failure_data |>
select(death_event, serum_creatinine, ejection_fraction)
head(heart_failure_classification)
## death_event serum_creatinine ejection_fraction
## 1 1 1.9 20
## 2 1 1.1 38
## 3 1 1.3 20
## 4 1 1.9 20
## 5 1 2.7 20
## 6 1 2.1 40
train_test_split <- createDataPartition(heart_failure_classification$death_event, p = 0.7, list = FALSE, times = 1)
classification_train <- heart_failure_classification[train_test_split, ]
classification_test <- heart_failure_classification[-train_test_split,]
head(classification_train)
## death_event serum_creatinine ejection_fraction
## 2 1 1.1 38
## 3 1 1.3 20
## 4 1 1.9 20
## 5 1 2.7 20
## 7 1 1.2 15
## 8 1 1.1 60
head(classification_test)
## death_event serum_creatinine ejection_fraction
## 1 1 1.9 20
## 6 1 2.1 40
## 9 1 1.5 65
## 10 1 9.4 35
## 11 1 4.0 38
## 16 1 1.3 50
Let us figure out the distribution of death event and its relationship with serum_creatinine and ejection_fraction
heart_failure_data |>
ggplot(aes(x = ejection_fraction, y = serum_creatinine, color = death_event))+
geom_point()
There seems to be a relative split in the value of serum creatinine by
ejection fraction and those patients who died and those who survived.
This leads me to believe that these two varialbes with higher
correlation could be of good prediction.
We can use individual graphs to plot if there is a good decision split for both serum_creatinine and ejection_fraction
heart_failure_data |>
ggplot(aes(x = serum_creatinine, fill = factor(death_event))) +
geom_density(alpha = 0.5) +
labs(x = "Serum Creatinine", y = "Density", fill = "Death Event") +
ggtitle("Density Plot of Serum Creatinine by Death Event")
heart_failure_data |>
ggplot(aes(x = ejection_fraction, fill = factor(death_event))) +
geom_density(alpha = 0.5) +
labs(x = "Ejection Fraction", y = "Density", fill = "Death Event") +
ggtitle("Density Plot of Ejection Fraction by Death Event")
These graphs show that there could be relative split between the two
variables and the variable death_event.
Since we have no prior knowledge of the model data variables, we can use default priors:
log_reg <- stan_glm(
death_event ~ serum_creatinine + ejection_fraction,
data = classification_train,
family = binomial,
chains = 4,
iter = 5000 * 2,
seed = 84375)
##
## SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0.000111 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.11 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.375071 seconds (Warm-up)
## Chain 1: 0.441397 seconds (Sampling)
## Chain 1: 0.816468 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 1.7e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.381961 seconds (Warm-up)
## Chain 2: 0.490632 seconds (Sampling)
## Chain 2: 0.872593 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 1.5e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.375537 seconds (Warm-up)
## Chain 3: 0.439404 seconds (Sampling)
## Chain 3: 0.814941 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 1.5e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.370473 seconds (Warm-up)
## Chain 4: 0.484554 seconds (Sampling)
## Chain 4: 0.855027 seconds (Total)
## Chain 4:
prior_summary(log_reg)
## Priors for model 'log_reg'
## ------
## Intercept (after predictors centered)
## ~ normal(location = 0, scale = 2.5)
##
## Coefficients
## Specified prior:
## ~ normal(location = [0,0], scale = [2.5,2.5])
## Adjusted prior:
## ~ normal(location = [0,0], scale = [2.48,0.22])
## ------
## See help('prior_summary.stanreg') for more details
Now mcmc diagnostic:
mcmc_trace(log_reg)
mcmc_dens_overlay(log_reg)
mcmc_acf(log_reg)
neff_ratio(log_reg)
## (Intercept) serum_creatinine ejection_fraction
## 0.93615 0.83555 0.82960
post_samp <- data.frame(log_reg)
rhat(c(post_samp$`X.Intercept.`))
## [1] 1.00007
rhat(c(post_samp$serum_creatinine))
## [1] 1.000126
rhat(post_samp$ejection_fraction)
## [1] 1.000041
pp_check(log_reg) +
ggtitle("Posterior Predictive Check Logistic Regression")
The results are good. All the mcmc simulations and the numerical
variables are in line with what we expect them to be.
Further, the ppcheck shows that the Y is following the same trend as yrep, meaning the model fits the data relatively well. Further we can see that the distribution of the posterior predictive check has peaks at 0 and 1, which are what we expect for outcomes for the logistic regression model.
Let’s do a tidy summary of this model:
tidy(log_reg, c("fixed","aux"), conf.int = TRUE, conf.level = 0.8)
## # A tibble: 4 × 5
## term estimate std.error conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.801 0.595 0.0156 1.58
## 2 serum_creatinine 0.629 0.180 0.413 0.878
## 3 ejection_fraction -0.0654 0.0163 -0.0869 -0.0448
## 4 mean_PPD 0.343 0.0421 0.290 0.395
The tidy summary tells us the coefficient estimates of the logistic regression model. For the serum_creatinine predictor, we can confidently say that given all the other variables are held constant, the typical log odds changes by 0.63 meaning, it will become more likely for death event to happen. Further, the ejection fraction has estimate -0.06, with a small standard error, which implies that the log odds will change typically by -0.06. The death event log odds will decrease with the ejection_fraction.
When all the other predictors are = 0, then the log odds is 0.8.
We need to do posterior prediction on the test data. We will use a threshold of 0.5 to represent the death event, although there are better ways to pick thresholds.
posterior_predictions_log_reg <- posterior_predict(log_reg, newdata = classification_test)
post_predict_df <- data.frame(posterior_predictions_log_reg)
mean_predictions <- apply(posterior_predictions_log_reg, 2, mean)
predicted_classes_lr <- ifelse(mean_predictions > 0.5, 1, 0)
predicted_classes_lr
## 1 6 9 10 11 16 22 30 35 36 38 41 42 44 49 50 56 58 59 68
## 1 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0
## 70 72 74 77 80 81 82 87 88 95 97 99 100 101 104 105 107 111 112 116
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 117 119 122 125 131 133 136 137 138 148 149 160 166 167 172 173 174 176 179 183
## 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0
## 190 191 197 203 217 219 224 226 227 232 235 236 238 244 245 248 251 253 255 259
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## 262 265 271 275 278 289 290 291 293
## 0 0 0 0 0 0 0 0 0
align_predict_lr <- as.data.frame(predicted_classes_lr)
align_predict_lr
## predicted_classes_lr
## 1 1
## 6 0
## 9 0
## 10 1
## 11 1
## 16 0
## 22 0
## 30 0
## 35 0
## 36 1
## 38 0
## 41 1
## 42 0
## 44 0
## 49 1
## 50 0
## 56 1
## 58 0
## 59 1
## 68 0
## 70 1
## 72 0
## 74 0
## 77 0
## 80 0
## 81 0
## 82 0
## 87 0
## 88 0
## 95 0
## 97 0
## 99 0
## 100 0
## 101 0
## 104 0
## 105 0
## 107 0
## 111 0
## 112 0
## 116 0
## 117 0
## 119 0
## 122 0
## 125 1
## 131 0
## 133 0
## 136 0
## 137 0
## 138 1
## 148 0
## 149 1
## 160 0
## 166 0
## 167 0
## 172 0
## 173 0
## 174 0
## 176 0
## 179 0
## 183 0
## 190 0
## 191 0
## 197 0
## 203 0
## 217 0
## 219 0
## 224 0
## 226 0
## 227 0
## 232 0
## 235 0
## 236 0
## 238 0
## 244 0
## 245 0
## 248 1
## 251 0
## 253 0
## 255 0
## 259 0
## 262 0
## 265 0
## 271 0
## 275 0
## 278 0
## 289 0
## 290 0
## 291 0
## 293 0
predictions_log <- factor(align_predict_lr$predicted_classes_lr, levels = c(0, 1))
valid_actual_labels <- factor(classification_test$death_event, levels = c(0, 1))
conf_matrix_log_reg <- confusionMatrix(predictions_log, valid_actual_labels)
conf_matrix_log_reg
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 63 13
## 1 2 11
##
## Accuracy : 0.8315
## 95% CI : (0.7373, 0.9025)
## No Information Rate : 0.7303
## P-Value [Acc > NIR] : 0.017746
##
## Kappa : 0.4998
##
## Mcnemar's Test P-Value : 0.009823
##
## Sensitivity : 0.9692
## Specificity : 0.4583
## Pos Pred Value : 0.8289
## Neg Pred Value : 0.8462
## Prevalence : 0.7303
## Detection Rate : 0.7079
## Detection Prevalence : 0.8539
## Balanced Accuracy : 0.7138
##
## 'Positive' Class : 0
##
The accuracy is 0.82
AUC:
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following object is masked from 'package:posterior':
##
## var
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_lr <- roc(response = classification_test$death_event, predictor = align_predict_lr$predicted_classes_lr)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_value <- auc(roc_lr)
auc_value
## Area under the curve: 0.7138
The auc is 0.63
The naive bayes classifier will have a similar process as the logistic regression model. First we will train it on our training data and then make posterior predictions.
library(e1071)
naive_baye_model <- naiveBayes(death_event ~ serum_creatinine + ejection_fraction, data = classification_train)
naive_baye_predictions <- predict(naive_baye_model, newdata = classification_test, type="raw")
naive_baye_predictions
## 0 1
## [1,] 4.417551e-01 0.5582449
## [2,] 7.171903e-01 0.2828097
## [3,] 8.103945e-01 0.1896055
## [4,] 1.096566e-18 1.0000000
## [5,] 1.990143e-02 0.9800986
## [6,] 8.628879e-01 0.1371121
## [7,] 7.133726e-01 0.2866274
## [8,] 7.526943e-01 0.2473057
## [9,] 8.698624e-01 0.1301376
## [10,] 8.524534e-02 0.9147547
## [11,] 8.698624e-01 0.1301376
## [12,] 4.602127e-01 0.5397873
## [13,] 7.526943e-01 0.2473057
## [14,] 8.698624e-01 0.1301376
## [15,] 1.135321e-03 0.9988647
## [16,] 7.578849e-01 0.2421151
## [17,] 6.264962e-01 0.3735038
## [18,] 8.312432e-01 0.1687568
## [19,] 5.658848e-01 0.4341152
## [20,] 6.787048e-01 0.3212952
## [21,] 5.781687e-01 0.4218313
## [22,] 8.098241e-01 0.1901759
## [23,] 8.628879e-01 0.1371121
## [24,] 8.544657e-01 0.1455343
## [25,] 8.615448e-01 0.1384552
## [26,] 7.979484e-01 0.2020516
## [27,] 8.672250e-01 0.1327750
## [28,] 8.092212e-01 0.1907788
## [29,] 8.544657e-01 0.1455343
## [30,] 8.252651e-01 0.1747349
## [31,] 6.654259e-01 0.3345741
## [32,] 6.725501e-01 0.3274499
## [33,] 8.422866e-01 0.1577134
## [34,] 6.769581e-01 0.3230419
## [35,] 7.578849e-01 0.2421151
## [36,] 7.011900e-01 0.2988100
## [37,] 8.547076e-01 0.1452924
## [38,] 8.534905e-01 0.1465095
## [39,] 8.061010e-01 0.1938990
## [40,] 8.433462e-01 0.1566538
## [41,] 8.509158e-01 0.1490842
## [42,] 8.564657e-01 0.1435343
## [43,] 8.321526e-01 0.1678474
## [44,] 5.868048e-02 0.9413195
## [45,] 8.509158e-01 0.1490842
## [46,] 8.402701e-01 0.1597299
## [47,] 8.402701e-01 0.1597299
## [48,] 8.564657e-01 0.1435343
## [49,] 4.987633e-01 0.5012367
## [50,] 8.569675e-01 0.1430325
## [51,] 6.700812e-01 0.3299188
## [52,] 8.610777e-01 0.1389223
## [53,] 8.235670e-01 0.1764330
## [54,] 8.509158e-01 0.1490842
## [55,] 8.428053e-01 0.1571947
## [56,] 8.559843e-01 0.1440157
## [57,] 5.628694e-01 0.4371306
## [58,] 8.569675e-01 0.1430325
## [59,] 8.534905e-01 0.1465095
## [60,] 6.725501e-01 0.3274499
## [61,] 8.615448e-01 0.1384552
## [62,] 6.682284e-01 0.3317716
## [63,] 8.289013e-01 0.1710987
## [64,] 8.564657e-01 0.1435343
## [65,] 8.693989e-01 0.1306011
## [66,] 8.092212e-01 0.1907788
## [67,] 6.742701e-01 0.3257299
## [68,] 8.436725e-01 0.1563275
## [69,] 6.654259e-01 0.3345741
## [70,] 8.092212e-01 0.1907788
## [71,] 8.610777e-01 0.1389223
## [72,] 8.689543e-01 0.1310457
## [73,] 7.526943e-01 0.2473057
## [74,] 8.428053e-01 0.1571947
## [75,] 7.682639e-01 0.2317361
## [76,] 3.860193e-01 0.6139807
## [77,] 7.310205e-01 0.2689795
## [78,] 8.620317e-01 0.1379683
## [79,] 8.569675e-01 0.1430325
## [80,] 6.742701e-01 0.3257299
## [81,] 8.368279e-01 0.1631721
## [82,] 8.281932e-01 0.1718068
## [83,] 7.133726e-01 0.2866274
## [84,] 7.578849e-01 0.2421151
## [85,] 8.310324e-01 0.1689676
## [86,] 8.092212e-01 0.1907788
## [87,] 8.315808e-01 0.1684192
## [88,] 8.659366e-01 0.1340634
## [89,] 8.321526e-01 0.1678474
We have the predictions for the two classes, so let’s take the prediction thats larger as the test prediction:
predicted_class_nb <- ifelse(naive_baye_predictions[, 2] > 0.5, 1, 0)
aligned_nb <- data.frame(predicted_class_nb)
# Convert predictions to a factor
predicted_factor_nb <- factor(predicted_class_nb, levels = c(0, 1))
predicted_factor_nb
## [1] 1 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## [77] 0 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
Now we can evaluate like the other model and check the accuracy:
conf_matrix_nb <- confusionMatrix(predicted_factor_nb, valid_actual_labels)
conf_matrix_nb
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 63 17
## 1 2 7
##
## Accuracy : 0.7865
## 95% CI : (0.6869, 0.8663)
## No Information Rate : 0.7303
## P-Value [Acc > NIR] : 0.140403
##
## Kappa : 0.325
##
## Mcnemar's Test P-Value : 0.001319
##
## Sensitivity : 0.9692
## Specificity : 0.2917
## Pos Pred Value : 0.7875
## Neg Pred Value : 0.7778
## Prevalence : 0.7303
## Detection Rate : 0.7079
## Detection Prevalence : 0.8989
## Balanced Accuracy : 0.6304
##
## 'Positive' Class : 0
##
roc_nb <- roc(response = classification_test$death_event, predictor = predicted_class_nb)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_nb <- auc(roc_nb)
auc_nb
## Area under the curve: 0.6304
AUC: 0.63
Both the auc and the accuracy are slightly worse than the logistic regression model that we developed. Let’s take a look at the random forest algorithm
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
rf_model <- randomForest(death_event ~ serum_creatinine + ejection_fraction, data = classification_train)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
rf_predictions <- predict(rf_model, newdata = classification_test)
rf_predictions
## 1 6 9 10 11 16
## 0.993051534 0.820481442 0.166901060 0.308214786 0.538257605 0.201458028
## 22 30 35 36 38 41
## 0.665784617 0.257040651 0.391692463 0.111214786 0.391692463 0.992851534
## 42 44 49 50 56 58
## 0.257040651 0.391692463 0.609274622 0.197374048 0.739375081 0.393583181
## 59 68 70 72 74 77
## 0.528102596 0.658408377 0.793726800 0.197479885 0.201458028 0.021335011
## 80 81 82 87 88 95
## 0.063846420 0.135760147 0.097988080 0.159875322 0.021335011 0.084248484
## 97 99 100 101 104 105
## 0.532147054 0.599085709 0.201442794 0.523038683 0.197374048 0.731598841
## 107 111 112 116 117 119
## 0.481191965 0.300921768 0.031100763 0.121571900 0.029799788 0.060653633
## 122 125 131 133 136 137
## 0.263218925 0.354682528 0.029799788 0.016794951 0.016794951 0.060653633
## 138 148 149 160 166 167
## 0.793957892 0.143981555 0.737875081 0.210214077 0.095002947 0.029799788
## 172 173 174 176 179 183
## 0.047352691 0.452806543 0.729092475 0.143981555 0.300921768 0.599085709
## 190 191 197 203 217 219
## 0.063846420 0.514960817 0.072361065 0.060653633 0.186159819 0.159875322
## 224 226 227 232 235 236
## 0.727707766 0.208771746 0.532147054 0.159875322 0.210214077 0.254224151
## 238 244 245 248 251 253
## 0.257040651 0.047352691 0.505333382 0.733666983 0.353365008 0.127505764
## 255 259 262 265 271 275
## 0.143981555 0.727707766 0.006813921 0.080320894 0.665784617 0.197374048
## 278 289 290 291 293
## 0.386923102 0.159875322 0.264677975 0.051345480 0.263218925
predicted_class_rf <- ifelse(rf_predictions > 0.5, 1, 0)
aligned_rf <- data.frame(predicted_class_rf)
# Convert predictions to a factor
predicted_factor_rf <- factor(predicted_class_rf, levels = c(0, 1))
predicted_factor_rf
## 1 6 9 10 11 16 22 30 35 36 38 41 42 44 49 50 56 58 59 68
## 1 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 1
## 70 72 74 77 80 81 82 87 88 95 97 99 100 101 104 105 107 111 112 116
## 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0
## 117 119 122 125 131 133 136 137 138 148 149 160 166 167 172 173 174 176 179 183
## 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1
## 190 191 197 203 217 219 224 226 227 232 235 236 238 244 245 248 251 253 255 259
## 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1
## 262 265 271 275 278 289 290 291 293
## 0 0 1 0 0 0 0 0 0
## Levels: 0 1
conf_matrix_rf<- confusionMatrix(predicted_factor_rf, valid_actual_labels)
conf_matrix_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 52 12
## 1 13 12
##
## Accuracy : 0.7191
## 95% CI : (0.6138, 0.8093)
## No Information Rate : 0.7303
## P-Value [Acc > NIR] : 0.6458
##
## Kappa : 0.2961
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.8000
## Specificity : 0.5000
## Pos Pred Value : 0.8125
## Neg Pred Value : 0.4800
## Prevalence : 0.7303
## Detection Rate : 0.5843
## Detection Prevalence : 0.7191
## Balanced Accuracy : 0.6500
##
## 'Positive' Class : 0
##
roc_rf <- roc(response = classification_test$death_event, predictor = predicted_class_rf)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_rf <- auc(roc_rf)
auc_rf
## Area under the curve: 0.65
Just taking a look at the three models, here are the comparisons:
All models were trained on the same training data and fit on the
testing data to determine results:
In conclusion, the Logistic Regression Model performs the best when under the conditions of the same training set and the same test set, with the same seed. It seems to output the greatest AUC, which means that the model is good at predicting true positives and predicting true negatives.
In this analysis, there are a few things that can be summarized:
I conducted regression and classification problems using bayesian and
other techniques. The regression portion consisted of predicting the
serum_creatinine based on the age of the people and also whether or not
people had high blood pressure or not. The classification portion
consisted of predicting whether or not the patient had survived after
the follow up period, given their serum_creatinine levels and ejection
fraction levels.
The variables that were chosen to use as predictor variables were
found in a correlation matrix and the variables that seemed to have the
greatest correlation visually, no matter negative or positive. There
were few variables like that so it was easy to choose.
For the regression task, I used stan_glm model with default priors,
given I had no background knowledge on the topic, it would be risky to
make good predictions based on no knowledge, so I used default priors.
The variables that I ended up choosing were serum_creatinine and
high_blood_pressure, because they had moderate correlation in the
correlation matrix. Further, I then tested the model without interaction
vs. the model with interaction. The result was that both models
performed poorly, as correlation between follow up time and age and high
blood pressure was weak. However, the interaction term was tested and
numerically concluded that it was not required in the model. It
performed worse in cross validation and loo_compare.
For the classification task, I tested three classification models on the death_event variable: logistic regression (simulated using rstanarm), naive bayes classifier, and also a random forest model. The data, for the sake of the non-stanglm simulated models, was split into a test set and a train set. The test set was used as my benchmark to compare the different models using AUC (area under curve) and accuracy from a confusion matrix. Overall, I concluded that the logistic regression model had the greatest predictive capability because it had the greatest accuracy and auc on the test data. The model that I picked, the logistic regression, had a test accuracy of 0.82 and test auc of 0.63.
Some future work that can be done on this dataset is to conduct feature selection or some sort of data scaling to get better results. Perhaps discriminating by sex can get us better, more informative results in predictions based on heart failure. This preliminary analysis has given us good results and perhaps other models may perform better on the data based on their mathematical background.
Credit:
Dataset: https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records
On my honor, I have neither received nor given any unauthorized assistance on this project, Alan Wu (208000574)